uni <- read.table(file="/Volumes/SD/GoogleDrive/0.DA/Assignments/2/uni.csv",
header=TRUE, sep=";", fill = TRUE, quote="\"")
head(uni)
## Name Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University Yes 1660 1232 721 23
## 2 Adelphi University Yes 2186 1924 512 16
## 3 Adrian College Yes 1428 1097 336 22
## 4 Agnes Scott College Yes 417 349 137 60
## 5 Alaska Pacific University Yes 193 146 55 16
## 6 Albertson College Yes 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1 52 2885 537 7440 3300 450 2200 70
## 2 29 2683 1227 12280 6450 750 1500 29
## 3 50 1036 99 11250 3750 400 1165 53
## 4 89 510 63 12960 5450 450 875 92
## 5 44 249 869 7560 4120 800 1500 76
## 6 62 678 41 13500 3335 500 675 67
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1 78 18.1 12 7041 60
## 2 30 12.2 16 10527 56
## 3 66 12.9 30 8735 54
## 4 97 7.7 37 19016 59
## 5 72 11.9 2 10922 15
## 6 73 9.4 11 9727 55
We will first check the distribution of each variable and correct possible outliers.
summary(uni)
## Name Private Apps
## Abilene Christian University: 1 NB : 4 Min. : 81
## Adelphi University : 1 No :212 1st Qu.: 776
## Adrian College : 1 Yes:561 Median : 1558
## Agnes Scott College : 1 Mean : 3002
## Alaska Pacific University : 1 3rd Qu.: 3624
## Albertson College : 1 Max. :48094
## (Other) :771
## Accept Enroll Top10perc Top25perc
## Min. : 72 Min. : 35.0 Min. : 1.00 Min. : 9.0
## 1st Qu.: 604 1st Qu.: 243.0 1st Qu.:15.00 1st Qu.: 41.0
## Median : 1110 Median : 437.0 Median :23.00 Median : 54.0
## Mean : 2019 Mean : 794.1 Mean :27.56 Mean : 55.8
## 3rd Qu.: 2424 3rd Qu.: 904.0 3rd Qu.:35.00 3rd Qu.: 69.0
## Max. :26330 Max. :7157.0 Max. :96.00 Max. :100.0
##
## F.Undergrad P.Undergrad Outstate Room.Board
## Min. : 139 Min. : 1.0 Min. : 2340 Min. :1780
## 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320 1st Qu.:3597
## Median : 1707 Median : 353.0 Median : 9990 Median :4200
## Mean : 3700 Mean : 855.3 Mean :10441 Mean :4358
## 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925 3rd Qu.:5050
## Max. :31643 Max. :21836.0 Max. :21700 Max. :8124
##
## Books Personal PhD Terminal
## Min. : 96.0 Min. : 250 Min. : 8.00 Min. : 24.0
## 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00 1st Qu.: 71.0
## Median : 500.0 Median :1200 Median : 75.00 Median : 82.0
## Mean : 549.4 Mean :1341 Mean : 72.66 Mean : 79.7
## 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00 3rd Qu.: 92.0
## Max. :2340.0 Max. :6800 Max. :103.00 Max. :100.0
##
## S.F.Ratio perc.alumni Expend Grad.Rate
## Min. : 2.50 Min. : 0.00 Min. : 3186 Min. : 10.00
## 1st Qu.: 11.50 1st Qu.:13.00 1st Qu.: 6751 1st Qu.: 53.00
## Median : 13.60 Median :21.00 Median : 8377 Median : 65.00
## Mean : 14.22 Mean :22.74 Mean : 9660 Mean : 65.47
## 3rd Qu.: 16.50 3rd Qu.:31.00 3rd Qu.:10830 3rd Qu.: 78.00
## Max. :122.10 Max. :64.00 Max. :56233 Max. :118.00
## NA's :3
We first noticed that the Private column is encoded as Yes, No or NB. We will turn the Yes value to 1, all others will be 0.
uni$Private <- as.integer(uni$Private == "Yes")
head(uni)
## Name Private Apps Accept Enroll Top10perc
## 1 Abilene Christian University 1 1660 1232 721 23
## 2 Adelphi University 1 2186 1924 512 16
## 3 Adrian College 1 1428 1097 336 22
## 4 Agnes Scott College 1 417 349 137 60
## 5 Alaska Pacific University 1 193 146 55 16
## 6 Albertson College 1 587 479 158 38
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD
## 1 52 2885 537 7440 3300 450 2200 70
## 2 29 2683 1227 12280 6450 750 1500 29
## 3 50 1036 99 11250 3750 400 1165 53
## 4 89 510 63 12960 5450 450 875 92
## 5 44 249 869 7560 4120 800 1500 76
## 6 62 678 41 13500 3335 500 675 67
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 1 78 18.1 12 7041 60
## 2 30 12.2 16 10527 56
## 3 66 12.9 30 8735 54
## 4 97 7.7 37 19016 59
## 5 72 11.9 2 10922 15
## 6 73 9.4 11 9727 55
require(ggplot2)
require(data.table)
ggplot(data = melt(uni), mapping = aes(x = value)) +
geom_histogram(bins = 20) + facet_wrap(~variable, scales = 'free_x')
There are unusual values for several columns (P.Undergrad, S.F.Ratio and Books) which we will filter out by replacing them with median values. The values might actually be plausible but they are such outliers that it’s a good idea to replace them.
uni$Books[(uni$Books > 1500)] <- median(uni$Books, na.rm=TRUE)
uni$P.Undergrad[(uni$P.Undergrad > 20000)] <- median(uni$P.Undergrad, na.rm=TRUE)
uni$S.F.Ratio[(uni$S.F.Ratio > 100)] <- median(uni$S.F.Ratio, na.rm=TRUE)
We also have to fix NA values:
uni$Grad.Rate[is.na(uni$Grad.Rate)] <- median(uni$Grad.Rate, na.rm=TRUE)
First we will remove non-numeric columns like the University name from the dataframe:
uni = uni[,-1]
Now we can start the analysis by printing and plotting a graph of correlations of all variables with the Acceptance number:
print(cor(uni)[,2])
## Private Apps Accept Enroll Top10perc Top25perc
## -0.42196498 1.00000000 0.94345057 0.80722854 0.33883368 0.35163990
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## 0.81449058 0.41478874 0.05015903 0.16493896 0.16083904 0.17873085
## PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 0.39069733 0.36949147 0.09529911 -0.09022589 0.25959198 0.14682609
We can also create a barplot of the correlations:
barplot(cor(uni)[,2], width = 30)
An important trend to notice is that most of the variables are positively correlated with the number of accepted student, the one notable exception is whether the university is private. Another smaller expection is Percent Alumni column meaning that schools with more alumni who donate accept fewer students.
Now we are going to fit a linear model for every every single predictor in X (ie. each variable/column) to predict y value.
We will also create a plot for each of those showing it’s predictive capabilites.
It’s important to understand that we don’t expect very good results from this experiment because we using a limited set of information.
We do expect that predictors which are positively correlated will have upward slopes while the negatively correlated predictors will have downward slopes (Private).
for(i in names(uni[,-2])){
mod1 <- lm(Apps ~ get(i), data=uni)
x = uni[,c(i)]
y = uni$Apps
plot(x, y, xlim=c(min(x)-1, max(x)+1), ylim=c(min(y)-10, max(y)+10), xlab = i, ylab = 'applications')
abline(mod1, lwd=2)
}
We can notice that variables with negative correlation to the number of accepted students also had downward (negatively) sloped linear regression charts (ie. negative coefficient).
On the other all positively correlated variables had upward sloping models (ie. positive coefficients).
Now we will try to predict the number of accepted students using all of the variables (predictors).
First we will fit a linear model.
Then we are going to analyze it’s coefficients to determine which predictors are useful and then the correlations between predictors.
lm.fit = lm(Apps~., data=uni)
After we fit the model we will explore it’s internals and metrics:
summary(lm.fit, correlation=TRUE)
##
## Call:
## lm(formula = Apps ~ ., data = uni)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5005.3 -433.0 -34.0 324.2 8702.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -570.62890 410.82396 -1.389 0.165244
## Private -495.26757 136.23366 -3.635 0.000296 ***
## Accept 1.51537 0.03738 40.535 < 2e-16 ***
## Enroll -0.28673 0.10907 -2.629 0.008742 **
## Top10perc 47.27397 5.61554 8.418 < 2e-16 ***
## Top25perc -13.05656 4.52120 -2.888 0.003989 **
## F.Undergrad -0.01738 0.02575 -0.675 0.500052
## P.Undergrad 0.02050 0.03925 0.522 0.601573
## Outstate -0.08646 0.01921 -4.501 7.82e-06 ***
## Room.Board 0.17176 0.04871 3.526 0.000447 ***
## Books 0.20364 0.28106 0.725 0.468963
## Personal 0.02556 0.06419 0.398 0.690639
## PhD -7.89128 4.65808 -1.694 0.090655 .
## Terminal -3.80930 5.14388 -0.741 0.459196
## S.F.Ratio 14.07978 13.14300 1.071 0.284386
## perc.alumni -0.38949 4.17088 -0.093 0.925623
## Expend 0.07741 0.01249 6.196 9.48e-10 ***
## Grad.Rate 7.94597 2.98554 2.661 0.007944 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1053 on 759 degrees of freedom
## Multiple R-squared: 0.9276, Adjusted R-squared: 0.9259
## F-statistic: 571.6 on 17 and 759 DF, p-value: < 2.2e-16
##
## Correlation of Coefficients:
## (Intercept) Private Accept Enroll Top10perc Top25perc
## Private -0.28
## Accept 0.14 0.04
## Enroll -0.09 -0.03 -0.34
## Top10perc 0.27 -0.01 0.12 -0.06
## Top25perc -0.19 0.00 -0.06 0.05 -0.80
## F.Undergrad 0.02 0.18 -0.41 -0.60 -0.05 -0.04
## P.Undergrad -0.01 0.04 0.07 0.09 0.08 -0.01
## Outstate 0.06 -0.33 -0.17 0.05 -0.05 0.00
## Room.Board -0.17 -0.09 -0.10 0.03 0.04 0.01
## Books -0.22 -0.05 -0.01 -0.01 -0.04 -0.04
## Personal -0.28 0.00 0.06 0.01 -0.01 0.01
## PhD -0.01 0.12 -0.02 -0.04 -0.12 -0.01
## Terminal -0.25 0.10 0.01 0.06 0.11 -0.12
## S.F.Ratio -0.60 0.15 -0.02 0.01 0.00 0.03
## perc.alumni -0.04 -0.10 0.16 -0.09 -0.02 -0.06
## Expend -0.22 0.08 -0.04 -0.01 -0.38 0.21
## Grad.Rate -0.25 -0.07 -0.13 0.04 -0.07 -0.08
## F.Undergrad P.Undergrad Outstate Room.Board Books Personal
## Private
## Accept
## Enroll
## Top10perc
## Top25perc
## F.Undergrad
## P.Undergrad -0.33
## Outstate 0.09 0.01
## Room.Board 0.05 -0.14 -0.34
## Books 0.00 0.00 0.04 -0.10
## Personal -0.11 -0.09 0.07 0.07 -0.21
## PhD 0.04 -0.05 -0.06 0.02 0.09 -0.04
## Terminal -0.08 -0.02 -0.09 -0.11 -0.09 0.04
## S.F.Ratio -0.06 0.00 0.08 0.01 -0.05 0.07
## perc.alumni -0.01 0.10 -0.20 0.15 0.02 0.09
## Expend -0.01 0.01 -0.28 -0.11 -0.04 -0.05
## Grad.Rate 0.01 0.13 -0.14 -0.13 0.04 0.07
## PhD Terminal S.F.Ratio perc.alumni Expend
## Private
## Accept
## Enroll
## Top10perc
## Top25perc
## F.Undergrad
## P.Undergrad
## Outstate
## Room.Board
## Books
## Personal
## PhD
## Terminal -0.73
## S.F.Ratio -0.06 0.02
## perc.alumni 0.01 -0.08 0.06
## Expend -0.02 -0.05 0.37 -0.02
## Grad.Rate -0.05 0.05 -0.03 -0.20 0.09
We can observe several things:
The Coefficients statistics tell us how significant each predictor is. A small p-value indicates that it is unlikely we will observe a relationship between the predictor and response variables due to chance. For the predictors whose p-value is below 0.01 we can say that there is a relationship between the predictor (x) and the response. (y). This means that we can reject the null hypothesis that there is no relationship between the predictor and the response.
Using the p-value metric the most influential predictors are: Private, Accept, Enroll, Top10perc, Top25perc, Outstate, Room.Board, Expend and Grad.Rate. For all other predictors we can accept the null hypothesis that there is no relationship between the predictor and the response.
By analyzing the Correlation of Coefficients we can observe that as expected the predictors (and their coefficients) are not uncorrelated but rather mostly slighly correlated.
The R-squared value of 0.9276 is a solid result as it is very close to (ideal score) of 1. This value is somewhat higher than in actual use because it’s based on the same data the model was trained with.
The F-Statistic has a very low p-value meaning that the null hypothesis that all predictors are not useful is very unlikely. The F-Statistic rather concludes that atleast one of the predictors is useful in predicting the response.
We can use the package leaps to to find the best subset of 5 best variables using forward feature selection.
library(leaps)
regfit.fwd=regsubsets(Apps~.,data=uni,nvmax=5,method="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Apps ~ ., data = uni, nvmax = 5, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## Private FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Outstate FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 5
## Selection Algorithm: forward
## Private Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " "*" " " " " " "
## 5 ( 1 ) " " "*" "*" "*" " " " " " "
## Outstate Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 5 ( 1 ) " " "*" " "
We can see that the top 5 features are chosen in the following order of importance:
1. Accept
2. Top10Perc
3. Expend
4. Outstate
5. Enroll
Now we can also check the coefficients of the model:
coef(regfit.fwd, 5)
## (Intercept) Accept Enroll Top10perc Outstate
## -554.38069351 1.53453168 -0.33517131 32.13461106 -0.09057353
## Expend
## 0.08119256
To conclude, the forward feature selection procedure found a subset of variables which are most influential: Accept, Top10Perc, Expend, Outstate, Enroll.
When analyzing the performance of linear models built on a single predictor we concluded that the results were highly affected by correlations between individual variables.
We will again compute the correlations between individual variable and filter out the correlations of absolute value less than 0.35 as they are weaker correlations compared to other.
d = cor(uni)[,2]
d = d[-2]
d[abs(d) > 0.35 ]
## Private Accept Enroll Top25perc F.Undergrad P.Undergrad
## -0.4219650 0.9434506 0.8072285 0.3516399 0.8144906 0.4147887
## PhD Terminal
## 0.3906973 0.3694915
From this list we find that the most correlated single variables are: Private, Accept, Enroll, Top25perc, F.Undergrad, P.Undergrad, PhD, Terminal.
Now we will first compare this with the most influential variables based on their Standard Error and p-value that we calculated using multiple regression model. We had already determined those variables and they are: Private, Accept, Enroll, Top10perc, Top25perc, Outstate, Room.Board, Expend and Grad.Rate.
Furthermore, using feature selection on a multiple regression model we determined that 5 most informative variables were Accept, Top10Perc, Expend, Outstate, Enroll.
We can notice that variables Private, Accept, Enroll, Top25perc show up in both the simple single variable correlations and in the more advanced models suggesting that they are predictors with the strongest relationship to the Number of applications of students.
This makes a lot of sense as the number of applicants accepted, number of new students enrolled and number of new students from top 25 % of high school class are logically very extremely correlated to the number of applications because the students that applied to the university are also included in those numbers.
To better evaluate our model we have to test it on some data that it wasn’t trained with before because if we test the model with data points it was trained with it will likely produce unrealistically good results.
We will first split our data into a training dataset (75%) and a testing dataset (25%).
Using the training and testing split we will train our model on the 75% of the data and test it on the other 25%.
To analyze it’s performace we will compute the R-squared and RMSE metrics.
We will also evaluate the model based on all of the training data using 10-fold cross validation. This process is very useful for example to find the best hyperparameters of a model.
I will be using the caret and leaps packages for these calculations.
First we will split our dataset into training and testing data. Training will be done on 75% of the data and testing on the remaining 25%.
library(caret)
library(leaps)
set.seed(41)
train <- sample(seq_len(nrow(uni)), size = floor(0.75 * nrow(uni)))
test <- (-train)
Now let’s train a multiple linear regression model on our training data.
We will also summarize the model to quickly go over it’s metrics.
lm.fit = lm(Apps~.,data=uni[train,])
print(summary(lm.fit))
##
## Call:
## lm(formula = Apps ~ ., data = uni[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -5249.4 -414.5 -43.1 293.8 7510.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.258e+02 4.845e+02 -0.260 0.7953
## Private -6.143e+02 1.565e+02 -3.924 9.78e-05 ***
## Accept 1.639e+00 4.363e-02 37.565 < 2e-16 ***
## Enroll -9.879e-01 2.210e-01 -4.470 9.45e-06 ***
## Top10perc 3.944e+01 6.322e+00 6.239 8.66e-10 ***
## Top25perc -8.948e+00 5.096e+00 -1.756 0.0797 .
## F.Undergrad 6.114e-02 3.894e-02 1.570 0.1169
## P.Undergrad 6.027e-03 4.746e-02 0.127 0.8990
## Outstate -9.830e-02 2.235e-02 -4.398 1.31e-05 ***
## Room.Board 1.208e-01 5.658e-02 2.136 0.0331 *
## Books 1.224e-01 3.240e-01 0.378 0.7057
## Personal -3.030e-03 7.219e-02 -0.042 0.9665
## PhD -1.200e+01 5.368e+00 -2.236 0.0257 *
## Terminal -1.091e-01 5.895e+00 -0.019 0.9852
## S.F.Ratio 3.775e+00 1.627e+01 0.232 0.8166
## perc.alumni 1.730e-01 4.866e+00 0.036 0.9717
## Expend 9.355e-02 1.868e-02 5.008 7.37e-07 ***
## Grad.Rate 8.744e+00 3.489e+00 2.506 0.0125 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1033 on 564 degrees of freedom
## Multiple R-squared: 0.9368, Adjusted R-squared: 0.9349
## F-statistic: 491.8 on 17 and 564 DF, p-value: < 2.2e-16
Now we will evaluate the model on our testing data according to R-squared and RMSE metrics:
lm_pred <- predict(lm.fit, uni[test,])
lm_rmse <- RMSE(uni$Apps[test],lm_pred)
lm_r2 <- R2(uni$Apps[test],lm_pred)
sprintf('RMSE = %f',lm_rmse)
## [1] "RMSE = 1239.317749"
sprintf('r-squared = %f',lm_r2)
## [1] "r-squared = 0.864277"
Using 10-fold cross validation we can also estimate how good our model is by training it 10 times on 90% of the training data and testing it each of the 10 times on the remaining 10%:
train_control <- trainControl(method="cv", number=10, verboseIter = TRUE)
# train the model
model_lm_cv <- train(Apps~., data=uni[train,], trControl=train_control, method="lm")
## + Fold01: intercept=TRUE
## - Fold01: intercept=TRUE
## + Fold02: intercept=TRUE
## - Fold02: intercept=TRUE
## + Fold03: intercept=TRUE
## - Fold03: intercept=TRUE
## + Fold04: intercept=TRUE
## - Fold04: intercept=TRUE
## + Fold05: intercept=TRUE
## - Fold05: intercept=TRUE
## + Fold06: intercept=TRUE
## - Fold06: intercept=TRUE
## + Fold07: intercept=TRUE
## - Fold07: intercept=TRUE
## + Fold08: intercept=TRUE
## - Fold08: intercept=TRUE
## + Fold09: intercept=TRUE
## - Fold09: intercept=TRUE
## + Fold10: intercept=TRUE
## - Fold10: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
We can now analyze the results of Cross Validation found in the results field of the trained model:
print(model_lm_cv$results)
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 1099.672 0.9264773 630.8736 399.7144 0.06174034 91.28056
The Cross validation results provided us with more pessimistic scores compared to training the model because we are using a smaller amount of data to train the model.
The final model built with Cross Validation was built on the whole dataset so the results are the same as with our first model.
pred2 = predict(model_lm_cv,uni[test,])
rmse2 <- RMSE(uni$Apps[test],pred2)
r22 <- R2(uni$Apps[test],pred2)
sprintf('RMSE = %f',rmse2)
## [1] "RMSE = 1239.317749"
sprintf('r-squared = %f',r22)
## [1] "r-squared = 0.864277"
We will use the glmnet library to fit Ridge and Lasso regression models onto our training data.
Based on the cross validation results we will choose the best lambda coefficient for our model.
We will also plot how the number of variables with a coefficient greater than 0 and the mean square error (MSE) are affected by the lambda value.
library(glmnet)
grid=10^seq(10,-2,length=100)
x=model.matrix(Apps~.,uni)[,-2]
y=uni$Apps
ridge.mod=glmnet(x[train,],y[train],alpha=0,lambda=grid, thresh=1e-12)
ridge_cv.out=cv.glmnet(x[train,],y[train],alpha=0)
plot(ridge_cv.out)
On the chart we can see on the bottom of X axis the logarithmic value of lambda.
On the top of X axis we can see the number of variables (predictors) whose coefficients are above 0.
On the Y axis we can observe the Mean Squared Error of the model.
We can notice on the top of the plot that even though the lambda value grows to 2^14, we still have coefficient for all variables. This is normal for Ridge regression, but we would expect it to be different for Lasso regression.
Now we will find the best lambda value:
bestlam_ridge = ridge_cv.out$lambda.min
print(bestlam_ridge)
## [1] 422.9006
We will also evaluate the model on our testing data according to R-squared and RMSE metrics:
ridge_pred=predict(ridge.mod,s=bestlam_ridge,newx=x[test,])
ridge_rmse = RMSE(y[test],ridge_pred)
ridge_r2 = R2(y[test],ridge_pred)[1]
sprintf('RMSE = %f',ridge_rmse)
## [1] "RMSE = 1095.329008"
sprintf('r-squared = %f',ridge_r2)
## [1] "r-squared = 0.888471"
Now let’s fit a Lasso model and choose the best lambda value using cross validation.
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
lasso_cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(lasso_cv.out)
bestlam=lasso_cv.out$lambda.min
print(bestlam)
## [1] 23.1
lasso_pred=predict(lasso.mod,s=bestlam,newx=x[test,])
lasso_rmse = RMSE(y[test],lasso_pred)
lasso_r2 = R2(y[test],lasso_pred)[1]
sprintf('RMSE = %f',lasso_rmse)
## [1] "RMSE = 1105.637421"
sprintf('r-squared = %f',lasso_r2)
## [1] "r-squared = 0.888094"
We will finally compare the three models we tested:
- Normal Linear Multiple Regression model
- Ridge regression
- Lasso Regression
name = c("Linear Model", "Ridge", "Lasso")
rmse_list = c(lm_rmse,ridge_rmse, lasso_rmse)
r2_list = c(lm_r2,ridge_r2,lasso_r2)
df_res = data.frame(name, rmse_list,r2_list )
print(df_res)
## name rmse_list r2_list
## 1 Linear Model 1239.318 0.8642769
## 2 Ridge 1095.329 0.8884710
## 3 Lasso 1105.637 0.8880939
barplot(height =df_res$rmse_list, names.arg = df_res$name, xlab = "Model", ylab = "Root Mean Squared Error", beside = True)
barplot(height =df_res$r2_list, names.arg = df_res$name, xlab = "Model", ylab = "R-squared")
We can observe that both Ridge and Lasso regression models had the best R-squared and RMSE scores. Their scores were also very similar.
The Linear model had a slighly lower R-squared values value and a higher RMSE errors.
We can conclude that indeed Ridge and Lasso predictions improved the performance of our model.
Overall with the R-squared value of 0.9 our model can explain around 90% of the variance which is a solid score.
The RMSE of 1000 is somewhat large considering that the average number of application a university receives is 3002, but RMSE does penalize large deviations more.
We cannot perfectly predict the number of application but we do have a solid model.